Chen Yulin's BlogChen Yulin's Blog
HomeArchivesCategoriesTagsAbout
  目录
3D-LLM
Posted 2025-02-13Updated 2026-03-15Review3 minutes read (About 505 words)   visits

3D-LLM

Intro

Recent works have explored aligning images and videos with LLM for a new generation of multi-modal LLMs that equip LLMs with the ability to understand and reason about 2D images.
但是仍缺少对于3D物理空间进行分析的模型, which involves richer concepts such as spatial relationships, affordances, physics and interaction so on.

由此提出了inject the 3D world into large language models, 介绍一个全新的3D-llm模型族,可以将3D表示(即带有功能的3D点云)作为输入,并执行一系列与3D相关的任务。
优势:

  • 关于整个场景的长期记忆可以存储在整体3D表示中,而不是情节的部分视图观测值
  • 3D属性(如提供和空间关系)可以从3D表示形式中进行推论,远远超出了基于语言或基于2D图像的LLM的范围

挑战

  • 数据获取:3D数据的稀缺性阻碍了基于3D的基础模型的发展。 3D数据与语言描述配对甚至更难获得
    • 提出了一组独特的数据生成管道,这些管道可以生成大规模的3D数据与语言配对。
  • Obtain meaningful 3D features that could align with language features for 3D-LLMs: 一种方法是使用类似的对比性范式从头开始训练3D编码,以在2D图像和语言之间对齐。但是,该范式消耗了巨大的数据,时间和GPU资源。
    • 使用了一个3D功能提取器,该提取器构造了渲染的多视图图像的2D预处理特征的3D功能。最近,还使用了2D预训练的CLIP特征来训练其VLMS,也有很多视觉语言模型(例如Blip-2,Flamingo)。由于我们提取的3D功能与2D预处理的功能相同,因此我们可以无缝使用2D VLM作为骨架,并输入3D功能,以进行3D-LLM的有效训练。

TODO

3D-LLM

http://chen-yulin.github.io/2025/02/13/[OBS]Reconstruct Anything-3D-LLM/

Author

Chen Yulin

Posted on

2025-02-13

Updated on

2026-03-15

Licensed under

#RoboticsResearch-paperLLMMulti-modal3D-SceneEmbodied-AI
Scene-LLM
PointLLM

Comments

Chen Yulin

Chen Yulin

SJTU student

Manchester by the Sea

Posts

131

Categories

6

Tags

105

Follow

Catalogue

  • Intro
  • TODO

Archives

  • February 20268
  • November 20253
  • July 20252
  • May 20252
  • April 20259
  • March 202540
  • February 20259
  • January 202512
  • December 20246
  • November 20242
  • October 20244
  • September 20246
  • August 20241
  • July 20241
  • June 20241
  • May 20241
  • April 20244
  • March 20241
  • January 20241
  • December 20231
  • May 20231
  • August 20221
  • May 20226
  • April 20229

Recents

exist_label

2026-02-14

exist_label

Note

BAGEL-Unified-Multimodal-Pretraining

2026-02-06

BAGEL-Unified-Multimodal-Pretraining

Review

LingBot-VLA

2026-02-05

LingBot-VLA

Review

Mixture-of-Experts-Survey

2026-02-05

Mixture-of-Experts-Survey

Review

UniDiffuser

2026-02-03

UniDiffuser

Review

Tags

3D-Scene17
Atlas1
CADC1
CLIP11
CNN1
CV56
Chemistry1
Contrastive-Learning5
Csharp1
DINO3
DT1
Debate2
Diffusion2
DiffusionModel4
Discrete-Mathematics1
Embodied-AI18
Emoation1
Emotion9
FL1
FPN2
Foundation1
FoundationModel4
Functional programming1
Game1
Gated-NN3
Github1
HRI2
Haskell1
Hexo4
Hierarchical4
Html1
HumanoidRobot1
Image-Grounding2
Image-Text4
Image-generation2
Image2Text7
ImgGen3
ImitationLearning5
LLM15
LatentAction1
Latex1
Love2
ML8
MR/AR3
Message-Passing2
MoE2
Mod1
Multi-modal14
Multi-view1
MultiModal5
NLP6
NN7
Nodejs1
Object-Detection9
Open-Vocabulary11
OpenCV1
Panoptic1
Physical-Scene4
Plugin1
PoseEstimation3
Probability1
Promise1
Python1
Pytorch1
QML1
Quantum1
RL3
RNN3
ROS3
Reading3
Real2Sim2
Reconstruct13
Representation-Learning5
Research-paper97
RobotLearning13
Robotics29
SJTU-Lecture1
Scalability2
Scene-graph31
Scene-synthesis2
Segmentation7
Semantic14
Signals and Systems1
Sim2Real1
Snippets1
Subgraph1
Survey4
Task-Planning9
Tech Communication1
Transformer20
Translation-Embedding2
Travel1
Unified-Multimodal1
Unity1
VAE1
VLA2
VLM8
VLP5
VQ-VAE1
ViT5
Vim1
Visual-Relation23
WSL1
Web1
WorldModel2
Chen Yulin's BlogChen Yulin's Blog

© 2026 Chen Yulin  Powered by Hexo & Icarus

×